For this project, I worked with data from the 2021 Global Burden of Disease (GBD) study provided by the Institute for Health Metrics and Evaluation (IHME) (https://vizhub.healthdata.org/gbd-results/). After creating a free account, I accessed data on cardiovascular disease (CVD) mortality along with lifestyle factors such as daily fiber intake. I then merged these datasets and also explored incorporating data on tobacco use from Our World in Data (https://ourworldindata.org/grapher/share-of-adults-who-are-smoking-by-level-of-prosperity). However, due to excessive missing values, I decided not to include tobacco use in the final analysis.
Because the dataset was still large (10,000+ observations), I narrowed the focus to a more specific subgroup: females aged 65–69, resulting in 456 observations. I also joined GDP per capita data (sourced from Our World in Data: https://ourworldindata.org/grapher/gdp-per-capita-worldbank) to enrich the dataset and provide socioeconomic context for the analysis.
The following packages were used:
## Import the necessary libraries
library(rsample)
library(dplyr)
library(brms)
library(ggplot2)
library(corrr)
library(tidyr)
library(glue)
library(priorsense)
library(mice)
library(priorsense)
options(mc.cores = parallel::detectCores()) # paralellize if possible
options(brms.file_refit = "on_change") # save the files if the model has changed
# nice plotting theme:
theme_set(theme_linedraw() +
theme(panel.grid = element_blank()))
# load the data and begin preprocessing
cvd <- read.csv("./CVD_death_percent_updated.csv")
# the trans-fat column contained 0s, which I reasoned were used in place of missing values.
# I replaced these with NA and then applied multiple imputation, allowing me to
# preserve the number of rows for model comparison and maintain the distribution of values.
# Since trans-fat intake was highly right-skewed and needed to be log-transformed, handling the 0s in this way was essential.
# in contrast, the tobacco use column only had entries for 2010 and 2015, leaving most values missing.
# because imputing such a large proportion would introduce too much noise, I excluded this variable.
# change zeros to NA
cvd$trans_fat_pct_kcal_day[cvd$trans_fat_pct_kcal_day == 0] <- NA
# multiple imputation for trans fat
miceOut <- mice(
data = cvd,
seed = 42,
m = 10,
maxit = 5,
method = "pmm"
)
iter imp variable
1 1 trans_fat_pct_kcal_day smoking_or_tobacco_use
1 2 trans_fat_pct_kcal_day smoking_or_tobacco_use
1 3 trans_fat_pct_kcal_day smoking_or_tobacco_use
1 4 trans_fat_pct_kcal_day smoking_or_tobacco_use
1 5 trans_fat_pct_kcal_day smoking_or_tobacco_use
1 6 trans_fat_pct_kcal_day smoking_or_tobacco_use
1 7 trans_fat_pct_kcal_day smoking_or_tobacco_use
1 8 trans_fat_pct_kcal_day smoking_or_tobacco_use
1 9 trans_fat_pct_kcal_day smoking_or_tobacco_use
1 10 trans_fat_pct_kcal_day smoking_or_tobacco_use
2 1 trans_fat_pct_kcal_day smoking_or_tobacco_use
2 2 trans_fat_pct_kcal_day smoking_or_tobacco_use
2 3 trans_fat_pct_kcal_day smoking_or_tobacco_use
2 4 trans_fat_pct_kcal_day smoking_or_tobacco_use
2 5 trans_fat_pct_kcal_day smoking_or_tobacco_use
2 6 trans_fat_pct_kcal_day smoking_or_tobacco_use
2 7 trans_fat_pct_kcal_day smoking_or_tobacco_use
2 8 trans_fat_pct_kcal_day smoking_or_tobacco_use
2 9 trans_fat_pct_kcal_day smoking_or_tobacco_use
2 10 trans_fat_pct_kcal_day smoking_or_tobacco_use
3 1 trans_fat_pct_kcal_day smoking_or_tobacco_use
3 2 trans_fat_pct_kcal_day smoking_or_tobacco_use
3 3 trans_fat_pct_kcal_day smoking_or_tobacco_use
3 4 trans_fat_pct_kcal_day smoking_or_tobacco_use
3 5 trans_fat_pct_kcal_day smoking_or_tobacco_use
3 6 trans_fat_pct_kcal_day smoking_or_tobacco_use
3 7 trans_fat_pct_kcal_day smoking_or_tobacco_use
3 8 trans_fat_pct_kcal_day smoking_or_tobacco_use
3 9 trans_fat_pct_kcal_day smoking_or_tobacco_use
3 10 trans_fat_pct_kcal_day smoking_or_tobacco_use
4 1 trans_fat_pct_kcal_day smoking_or_tobacco_use
4 2 trans_fat_pct_kcal_day smoking_or_tobacco_use
4 3 trans_fat_pct_kcal_day smoking_or_tobacco_use
4 4 trans_fat_pct_kcal_day smoking_or_tobacco_use
4 5 trans_fat_pct_kcal_day smoking_or_tobacco_use
4 6 trans_fat_pct_kcal_day smoking_or_tobacco_use
4 7 trans_fat_pct_kcal_day smoking_or_tobacco_use
4 8 trans_fat_pct_kcal_day smoking_or_tobacco_use
4 9 trans_fat_pct_kcal_day smoking_or_tobacco_use
4 10 trans_fat_pct_kcal_day smoking_or_tobacco_use
5 1 trans_fat_pct_kcal_day smoking_or_tobacco_use
5 2 trans_fat_pct_kcal_day smoking_or_tobacco_use
5 3 trans_fat_pct_kcal_day smoking_or_tobacco_use
5 4 trans_fat_pct_kcal_day smoking_or_tobacco_use
5 5 trans_fat_pct_kcal_day smoking_or_tobacco_use
5 6 trans_fat_pct_kcal_day smoking_or_tobacco_use
5 7 trans_fat_pct_kcal_day smoking_or_tobacco_use
5 8 trans_fat_pct_kcal_day smoking_or_tobacco_use
5 9 trans_fat_pct_kcal_day smoking_or_tobacco_use
5 10 trans_fat_pct_kcal_day smoking_or_tobacco_use
Warning: Number of logged events: 4
completed_data <- complete(miceOut, 1)
cvd$trans_fat_pct_kcal_day <- completed_data$trans_fat_pct_kcal_day
# Because I wanted to predict the percentage of deaths attributed to CVD relative to total deaths,
# I needed to fit a Beta regression model. This required the target variable to be bounded between 0 and 1.
# Therefore, I divided percent_of_cvd_deaths by 100 to transform it into a proportion.
# adjust the target variable for Beta regression
cvd$proportion_of_cvd_deaths <- cvd$percent_of_cvd_deaths / 100
# I also decided to add GDP per capita as a predictor to provide socioeconomic context.
# I merged this feature from a separate dataset sourced from Our World in Data.
gdp_data <- read.csv("https://ourworldindata.org/grapher/gdp-per-capita-worldbank.csv?v=1&csvType=full&useColumnShortNames=true")
gdp_data
Entity Code Year ny_gdp_pcap_pp_kd owid_region
1 Afghanistan AFG 2000 1617.826
2 Afghanistan AFG 2001 1454.111
3 Afghanistan AFG 2002 1774.309
4 Afghanistan AFG 2003 1815.928
5 Afghanistan AFG 2004 1776.918
6 Afghanistan AFG 2005 1908.115
7 Afghanistan AFG 2006 1929.724
8 Afghanistan AFG 2007 2155.353
9 Afghanistan AFG 2008 2191.504
10 Afghanistan AFG 2009 2565.022
11 Afghanistan AFG 2010 2848.586
12 Afghanistan AFG 2011 2757.052
13 Afghanistan AFG 2012 2985.319
14 Afghanistan AFG 2013 3046.580
15 Afghanistan AFG 2014 3017.943
16 Afghanistan AFG 2015 2967.692
17 Afghanistan AFG 2016 2958.785
18 Afghanistan AFG 2017 2952.999
19 Afghanistan AFG 2018 2902.392
20 Afghanistan AFG 2019 2927.245
21 Afghanistan AFG 2020 2769.686
22 Afghanistan AFG 2021 2144.166
23 Afghanistan AFG 2022 1981.710
24 Afghanistan AFG 2023 1992.424 Asia
[ reached 'max' / getOption("max.print") -- omitted 7086 rows ]
colnames(gdp_data)[colnames(gdp_data) == "Entity"] <- "country"
colnames(gdp_data)[colnames(gdp_data) == "Year"] <- "year"
colnames(gdp_data)[colnames(gdp_data) == "ny_gdp_pcap_pp_kd"] <- "gdp_per_capita"
cvd_merged <- merge(cvd, gdp_data[, c("country", "year", "gdp_per_capita")],
by = c("country", "year")
)
# confirm that the merge was successful
length(cvd_merged)
[1] 19
length(cvd)
[1] 18
nrow(cvd_merged) == nrow(cvd)
[1] TRUE
sum(is.na(cvd_merged$gdp_per_capita))
[1] 0
# all checks passed, so rename back to cvd
cvd <- cvd_merged
Categorical variables:
Country: 38 European countries
Sex: male, female
age_group: Adults in 5 year age ranges (e.g., 60-64 years).
region: 7 custom clusters of European countries by geography
Numeric variables:
year: years 2008 to 2019
trans_fat_pct_kcal_day: Daily individual percent calorie intake of trans fatty acids estimates
sodium_g_per_day: Daily individual intake of sodium in grams estimates
fiber_g_per_day: Daily individual intake of fiber in grams estimates
fruit_g_per_day: Daily individual intake of fruits in grams estimates
legumes_g_per day: Daily individual intake of legumes in grams estimates
mean_prev_overweight: Mean prevalence of overweight individuals
mean_change_overweight: Mean 5 year change in prevalence of overweight individuals
mean_prev_obese: Mean prevalence of obese individuals
mean_change_obese: Mean 5 year change in prevalence of obese individuals
smoking_or_tobacco_use: Estimated percentage of adults aged 15 years and older who currently smoke or use tobacco. This includes smokeless products, such as chewing or snuffing tobacco, but excludes products that do not contain tobacco, such as e-cigarettes.
gdp_billions: gross domestic product (GDP) in billions of USD
gdp_per_capita: share of country’s GDP per person
proportion_of_cvd_deaths: proportion of total deaths caused by cardiovascular disease (CVD) relative to all other causes of death
Also, numerous columns needed to be log transformed due to the right-skewedness of their distributions, which can be seen in the code of question 2.
# Numerous columns needed to be log transformed due to the right-skewedness of their distributions, which can be seen with this code:
par(mfrow = c(2, 2))
numeric_columns <- sapply(cvd, is.numeric) & names(cvd) != "year"
for (col in names(cvd)[numeric_columns]) {
hist(cvd[[col]],
main = paste(col),
xlab = col,
col = "red",
breaks = 30
)
}
mtext("Distribution of Target & Features", outer = TRUE, cex = 1.5, font = 2)
# transforming and scaling the data
cvd <- cvd %>%
mutate(
s_year = scale(year),
s_log_trans_fat_pct_kcal_day = scale(log(trans_fat_pct_kcal_day)),
s_log_sodium_g_per_day = scale(log(sodium_g_per_day)),
s_log_fiber_g_per_day = scale(log(fiber_g_per_day)),
s_log_fruit_g_per_day = scale(log(fruit_g_per_day)),
s_log_legumes_g_per_day = scale(log(legumes_g_per_day)),
s_mean_prev_overweight = scale(mean_prev_overweight),
s_mean_change_overweight = scale(mean_change_overweight),
s_log_mean_prev_obese = scale(log(mean_prev_obese)),
s_log_mean_change_obese = scale(log(mean_change_obese)),
s_smoking_or_tobacco_use = scale(smoking_or_tobacco_use),
s_log_gdp = scale(log(gdp_billions)),
s_log_gdp_per_cap = scale(log(gdp_per_capita))
)
# filtering the data to only include females in the 65-69 age group
data_female_65_69 <- cvd %>%
filter(sex == "Female", age_group == "65-69 years")
# Split the dataset
split <- initial_split(data = data_female_65_69, prop = 0.8, strata = proportion_of_cvd_deaths)
train_data_female_65_69 <- training(split)
test_data_female_65_69 <- testing(split)
mod_1 fits a Beta regression to predict the target variable. In addition to using years, trans-fats per day and mean previous obesity rate, the model also used log GDP per capita as a predictor. This was done to compare its effect to log GDP, since it is possible that countries/regions with a lower GDP than others may have a higher share of GDP per person, which in turn could account for more variance in our target variable. Both metrics were not included due to high potential for multicolinearity. Additionally, (1 | region/country) was included as a hierarchical effect, which takes into account that countries are nested in regions — helping to model the possibility that countries in a region are more similar to each other than countries in other regions. Therefore, intercepts are generated by region, as well as by countries within a region. The model takes default BRMS priors.
mod_2 likewise fits a beta regression model with default priors to predict the probability of the proportion of deaths attributed to CVD out of all deaths in females aged 65-59 in Europe. The model predicts probability of CVD death percentage using year, the mean prevalence of obesity, grams of sodium per day, grams of fiber per day, number of calories from trans fat per day, and the interaction between fiber and trans fat. The model makes predictions using shared information in the country cluster, therefore the model assumes that the country affects CVD death proportion differently, but that the data for each country comes from the same distribution.
mod_1 <- brm(
formula = proportion_of_cvd_deaths ~ s_year + s_log_trans_fat_pct_kcal_day +
s_log_mean_prev_obese + s_log_gdp_per_cap +
(1 | region / country),
data = train_data_female_65_69,
family = Beta(link = "logit"),
control = list(adapt_delta = 0.9, max_treedepth = 15),
seed = 42,
sample_prior = TRUE,
save_pars = save_pars(all = TRUE),
file = "mod_1"
)
mod_2 <- brm(
formula = proportion_of_cvd_deaths ~ s_year + s_log_mean_prev_obese +
s_log_sodium_g_per_day +
s_log_fiber_g_per_day * s_log_trans_fat_pct_kcal_day +
(1 | country),
data = train_data_female_65_69,
family = Beta(link = logit),
seed = 42,
sample_prior = TRUE,
save_pars = save_pars(all = TRUE),
file = "mod_2"
)
The initial model showed convergence, as indicated by Rhat values all equal to 1 for all intercepts (normal and multilevel), fixed effects, as well as for phi. Both bulk and tail effective sample sizes (ESS) were quite high for all parameters, with the lowest ESS being a bulk ESS of 1,737, indicating good posterior estimation. Additionally, tail ESS was never far lower than bulk ESS, which would indicate that the model explored the tails poorly, leading to grossly inaccurate CIs and poor handling of outliers. Finally, the traceplots did not show funneling or consistent spikes and were generally well mixed. However, I received a warning that there were 7 divergent transitions after warm-up. While this isn’t many, it still indicated that there could have been issues in exploring the posterior. Therefore, I added “adapt_delta = 0.9”, since R suggested I increase it from its default of 0.8. Doing so would decrease the step size of Monte Carlo, helping to prevent divergences. I also received a warning that maximum tree depth should be increased, so I set max_treedepth = 15, thereby increasing the number of steps taken per iteration.
The second model shows convergence with Rhat values of 1.00 for most parameters, but with Rhat 1.01 for log sodium intake and the mulitlevel intercept for country. The bulk and tail effective sample sizes (ESS) for some effects were less than 1000, which is acceptable, but overall the ESS values are high, indicating rather robust results for the posterior estimates. The traceplots are well mixed and overlapping for all parameters, with no indication of divergent transitions. The density plots are smooth and unimodal with strong overlap between chains.
# checking model outputs
mod_1
Warning: There were 5 divergent transitions after warmup. Increasing adapt_delta above
0.9 may help. See
http://mc-stan.org/misc/warnings.html#divergent-transitions-after-warmup
Family: beta
Links: mu = logit; phi = identity
Formula: proportion_of_cvd_deaths ~ s_year + s_log_trans_fat_pct_kcal_day + s_log_mean_prev_obese + s_log_gdp_per_cap + (1 | region/country)
Data: train_data_female_65_69 (Number of observations: 364)
Draws: 4 chains, each with iter = 2000; warmup = 1000; thin = 1;
total post-warmup draws = 4000
Multilevel Hyperparameters:
~region (Number of levels: 7)
Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
sd(Intercept) 0.80 0.31 0.42 1.57 1.00 1525 2151
~region:country (Number of levels: 38)
Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
sd(Intercept) 0.25 0.04 0.19 0.33 1.00 1371 1983
Regression Coefficients:
Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
Intercept -0.92 0.32 -1.54 -0.26 1.00 2575 2521
s_year -0.11 0.01 -0.12 -0.09 1.00 3783 2746
s_log_trans_fat_pct_kcal_day 0.01 0.01 -0.01 0.03 1.00 5783 2727
s_log_mean_prev_obese 0.15 0.06 0.04 0.26 1.00 3586 2631
s_log_gdp_per_cap -0.04 0.02 -0.09 0.00 1.00 5455 2738
Further Distributional Parameters:
Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
phi 1610.85 124.90 1369.59 1864.09 1.00 4194 2804
Draws were sampled using sampling(NUTS). For each parameter, Bulk_ESS
and Tail_ESS are effective sample size measures, and Rhat is the potential
scale reduction factor on split chains (at convergence, Rhat = 1).
mod_2
Family: beta
Links: mu = logit; phi = identity
Formula: proportion_of_cvd_deaths ~ s_year + s_log_mean_prev_obese + s_log_sodium_g_per_day + s_log_fiber_g_per_day * s_log_trans_fat_pct_kcal_day + (1 | country)
Data: train_data_female_65_69 (Number of observations: 364)
Draws: 4 chains, each with iter = 2000; warmup = 1000; thin = 1;
total post-warmup draws = 4000
Multilevel Hyperparameters:
~country (Number of levels: 38)
Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
sd(Intercept) 0.57 0.08 0.44 0.73 1.01 664 961
Regression Coefficients:
Estimate Est.Error l-95% CI u-95% CI
Intercept -1.19 0.12 -1.42 -0.95
s_year -0.13 0.01 -0.15 -0.12
s_log_mean_prev_obese 0.27 0.07 0.14 0.42
s_log_sodium_g_per_day 0.04 0.06 -0.08 0.15
s_log_fiber_g_per_day 0.14 0.04 0.07 0.21
s_log_trans_fat_pct_kcal_day 0.01 0.01 -0.01 0.03
s_log_fiber_g_per_day:s_log_trans_fat_pct_kcal_day -0.00 0.01 -0.02 0.02
Rhat Bulk_ESS Tail_ESS
Intercept 1.00 787 1106
s_year 1.00 982 1444
s_log_mean_prev_obese 1.00 989 1152
s_log_sodium_g_per_day 1.01 1276 1585
s_log_fiber_g_per_day 1.00 1656 1872
s_log_trans_fat_pct_kcal_day 1.00 2548 2228
s_log_fiber_g_per_day:s_log_trans_fat_pct_kcal_day 1.00 2676 2349
Further Distributional Parameters:
Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
phi 1612.25 128.99 1367.00 1872.06 1.00 2023 1859
Draws were sampled using sampling(NUTS). For each parameter, Bulk_ESS
and Tail_ESS are effective sample size measures, and Rhat is the potential
scale reduction factor on split chains (at convergence, Rhat = 1).
# Trace and density plots
plot(mod_1)
plot(mod_2)
# Powerscale sensitivity analysis
powerscale_sensitivity(mod_1)
Sensitivity based on cjs_dist
Prior selection: all priors
Likelihood selection: all data
variable prior likelihood
b_Intercept 0.009 0.056
b_s_year 0.015 0.164
b_s_log_trans_fat_pct_kcal_day 0.015 0.111
b_s_log_mean_prev_obese 0.022 0.212
b_s_log_gdp_per_cap 0.016 0.139
sd_region__Intercept 0.028 0.089
sd_region:country__Intercept 0.010 0.066
phi 0.209 0.485
Intercept 0.010 0.028
r_region[Balkans,Intercept] 0.013 0.026
r_region[Baltic,Intercept] 0.010 0.020
r_region[Central.Europe,Intercept] 0.010 0.023
r_region[Eastern.Europe,Intercept] 0.009 0.033
r_region[Nordic,Intercept] 0.007 0.050
r_region[Southern.Europe,Intercept] 0.009 0.042
r_region[Western.Europe,Intercept] 0.008 0.043
r_region:country[Balkans_Albania,Intercept] 0.009 0.052
r_region:country[Balkans_Bosnia.and.Herzegovina,Intercept] 0.006 0.038
r_region:country[Balkans_Bulgaria,Intercept] 0.008 0.051
r_region:country[Balkans_Croatia,Intercept] 0.008 0.049
r_region:country[Balkans_Montenegro,Intercept] 0.007 0.051
r_region:country[Balkans_North.Macedonia,Intercept] 0.006 0.042
r_region:country[Balkans_Serbia,Intercept] 0.008 0.042
r_region:country[Baltic_Estonia,Intercept] 0.008 0.030
r_region:country[Baltic_Latvia,Intercept] 0.010 0.041
r_region:country[Baltic_Lithuania,Intercept] 0.008 0.034
r_region:country[Central.Europe_Austria,Intercept] 0.014 0.141
r_region:country[Central.Europe_Czechia,Intercept] 0.006 0.027
r_region:country[Central.Europe_Hungary,Intercept] 0.011 0.062
r_region:country[Central.Europe_Poland,Intercept] 0.006 0.033
diagnosis
-
-
-
-
-
-
-
potential prior-data conflict
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
[ reached 'max' / getOption("max.print") -- omitted 73 rows ]
powerscale_sensitivity(mod_2)
Sensitivity based on cjs_dist
Prior selection: all priors
Likelihood selection: all data
variable prior likelihood
b_Intercept 0.011 0.108
b_s_year 0.038 0.399
b_s_log_mean_prev_obese 0.043 0.421
b_s_log_sodium_g_per_day 0.016 0.231
b_s_log_fiber_g_per_day 0.016 0.232
b_s_log_trans_fat_pct_kcal_day 0.015 0.158
b_s_log_fiber_g_per_day:s_log_trans_fat_pct_kcal_day 0.017 0.140
sd_country__Intercept 0.021 0.179
phi 0.221 0.479
Intercept 0.013 0.045
r_country[Albania,Intercept] 0.021 0.197
r_country[Austria,Intercept] 0.010 0.093
r_country[Belarus,Intercept] 0.015 0.102
r_country[Belgium,Intercept] 0.013 0.156
r_country[Bosnia.and.Herzegovina,Intercept] 0.026 0.283
r_country[Bulgaria,Intercept] 0.020 0.156
r_country[Croatia,Intercept] 0.019 0.168
r_country[Cyprus,Intercept] 0.012 0.202
r_country[Czechia,Intercept] 0.020 0.179
r_country[Denmark,Intercept] 0.013 0.177
r_country[Estonia,Intercept] 0.010 0.051
r_country[Finland,Intercept] 0.006 0.083
r_country[France,Intercept] 0.015 0.217
r_country[Germany,Intercept] 0.010 0.122
r_country[Greece,Intercept] 0.009 0.035
r_country[Hungary,Intercept] 0.026 0.256
r_country[Iceland,Intercept] 0.009 0.135
r_country[Ireland,Intercept] 0.011 0.130
r_country[Italy,Intercept] 0.017 0.186
r_country[Latvia,Intercept] 0.012 0.064
diagnosis
-
-
-
-
-
-
-
-
potential prior-data conflict
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
[ reached 'max' / getOption("max.print") -- omitted 59 rows ]
mod_1: The powerscale sensitivity function found a potential prior-data conflict with the phi variable. The large discrepancy between prior and likelihood values for this variable indicate a severe mismatch between the priors and the actual posterior distribution. I reasoned that the default priors did not adequately account for phi. Also, the estimated phi was quite large (CI [1369.59, 1864.09]), so I decided to use a lognormal distribution of (log(1800), 0.25) for phi, amounting to CI[1385, 2400] to better capture the range I found. After running with this prior, I no longer had any warnings, and the CI range of my phi value was within the bounds of my prior.
mod_2: The powerscale sensitivity analysis likewise revealed a potential prior-data conflict with the default phi prior. I similarly adjusted the prior for phi to be (log(1800), 0.25) so that the CI of phi would be within this range as well, and also no longer had any warnings
# adjust both models
mod_1_lognormal <- brm(
formula = proportion_of_cvd_deaths ~
s_year +
s_log_trans_fat_pct_kcal_day +
s_log_mean_prev_obese +
s_log_gdp_per_cap +
(1 | region / country),
data = train_data_female_65_69,
prior = prior(lognormal(log(1800), 0.25), class = "phi"), # adjusted priors for phi
control = list(adapt_delta = 0.9, max_treedepth = 15),
family = Beta(link = "logit"),
seed = 42,
save_pars = save_pars(all = TRUE),
file = "mod_1_lognormal"
)
mod_2_lognormal <- brm(
formula = proportion_of_cvd_deaths ~ s_year + s_log_mean_prev_obese +
s_log_sodium_g_per_day +
s_log_fiber_g_per_day * s_log_trans_fat_pct_kcal_day +
(1 | country),
data = train_data_female_65_69,
prior = prior(lognormal(log(1800), 0.25), class = "phi"), # adjusted priors for phi
control = list(adapt_delta = 0.9, max_treedepth = 15),
family = Beta(link = logit),
seed = 42,
sample_prior = TRUE,
save_pars = save_pars(all = TRUE),
file = "mod_2_lognormal"
)
powerscale_sensitivity(mod_1_lognormal)
Sensitivity based on cjs_dist
Prior selection: all priors
Likelihood selection: all data
variable prior likelihood diagnosis
b_Intercept 0.009 0.054 -
b_s_year 0.001 0.153 -
b_s_log_trans_fat_pct_kcal_day 0.001 0.135 -
b_s_log_mean_prev_obese 0.002 0.205 -
b_s_log_gdp_per_cap 0.001 0.131 -
sd_region__Intercept 0.024 0.074 -
sd_region:country__Intercept 0.002 0.041 -
phi 0.017 0.250 -
Intercept 0.010 0.035 -
r_region[Balkans,Intercept] 0.010 0.017 -
r_region[Baltic,Intercept] 0.009 0.016 -
r_region[Central.Europe,Intercept] 0.009 0.022 -
r_region[Eastern.Europe,Intercept] 0.010 0.016 -
r_region[Nordic,Intercept] 0.009 0.045 -
r_region[Southern.Europe,Intercept] 0.009 0.041 -
r_region[Western.Europe,Intercept] 0.009 0.045 -
r_region:country[Balkans_Albania,Intercept] 0.001 0.044 -
r_region:country[Balkans_Bosnia.and.Herzegovina,Intercept] 0.001 0.017 -
r_region:country[Balkans_Bulgaria,Intercept] 0.000 0.038 -
r_region:country[Balkans_Croatia,Intercept] 0.001 0.046 -
r_region:country[Balkans_Montenegro,Intercept] 0.001 0.037 -
r_region:country[Balkans_North.Macedonia,Intercept] 0.001 0.019 -
r_region:country[Balkans_Serbia,Intercept] 0.001 0.026 -
r_region:country[Baltic_Estonia,Intercept] 0.001 0.028 -
r_region:country[Baltic_Latvia,Intercept] 0.001 0.020 -
r_region:country[Baltic_Lithuania,Intercept] 0.001 0.020 -
r_region:country[Central.Europe_Austria,Intercept] 0.000 0.107 -
r_region:country[Central.Europe_Czechia,Intercept] 0.002 0.031 -
r_region:country[Central.Europe_Hungary,Intercept] 0.002 0.053 -
r_region:country[Central.Europe_Poland,Intercept] 0.002 0.032 -
[ reached 'max' / getOption("max.print") -- omitted 69 rows ]
powerscale_sensitivity(mod_2_lognormal)
Sensitivity based on cjs_dist
Prior selection: all priors
Likelihood selection: all data
variable prior likelihood diagnosis
b_Intercept 0.001 0.151 -
b_s_year 0.002 0.360 -
b_s_log_mean_prev_obese 0.002 0.362 -
b_s_log_sodium_g_per_day 0.001 0.208 -
b_s_log_fiber_g_per_day 0.001 0.222 -
b_s_log_trans_fat_pct_kcal_day 0.001 0.127 -
b_s_log_fiber_g_per_day:s_log_trans_fat_pct_kcal_day 0.001 0.069 -
sd_country__Intercept 0.003 0.226 -
phi 0.017 0.327 -
Intercept 0.003 0.042 -
r_country[Albania,Intercept] 0.003 0.142 -
r_country[Austria,Intercept] 0.001 0.140 -
r_country[Belarus,Intercept] 0.002 0.082 -
r_country[Belgium,Intercept] 0.001 0.184 -
r_country[Bosnia.and.Herzegovina,Intercept] 0.003 0.238 -
r_country[Bulgaria,Intercept] 0.003 0.093 -
r_country[Croatia,Intercept] 0.003 0.111 -
r_country[Cyprus,Intercept] 0.001 0.221 -
r_country[Czechia,Intercept] 0.003 0.122 -
r_country[Denmark,Intercept] 0.001 0.208 -
r_country[Estonia,Intercept] 0.002 0.041 -
r_country[Finland,Intercept] 0.002 0.103 -
r_country[France,Intercept] 0.001 0.229 -
r_country[Germany,Intercept] 0.002 0.139 -
r_country[Greece,Intercept] 0.002 0.052 -
r_country[Hungary,Intercept] 0.003 0.211 -
r_country[Iceland,Intercept] 0.002 0.142 -
r_country[Ireland,Intercept] 0.001 0.153 -
r_country[Italy,Intercept] 0.001 0.206 -
r_country[Latvia,Intercept] 0.002 0.057 -
[ reached 'max' / getOption("max.print") -- omitted 59 rows ]
mod_1_lognormal: The density plot shows that the draws for posterior predictive distribution conform well to the shape of the observed data, save for not being able to capture the local maximum at x=0.5. However, this is quite marginal, as the trend of the observed data is captured well in general. Additionally, the scatter plot of observed vs. predicted values shows the mean and standard deviation of the observed data to be roughly in the center of the cloud of the means and standard deviations of the predicted data. Taken together, it appears that the model predicts the data quite well. Finally, the observed data seem to fit well overall in the predictions with respect to log GDP per capita, except for in a few countries (which appear as snake-like clusters in the “intervals” plot).
mod_2_lognormal: The scatter plot of the observed versus predicted values depicts the mean and standard deviation of the observed data to fall in the center of the predicted means/standard deviations. Also, the draws for the density plot all follow the shape of the observed data distribution closely. Therefore, the model seems to fit the data well.
# Posterior Predictive Checks for first model
pp_check(mod_1_lognormal,
type = "intervals",
x = "s_log_gdp_per_cap",
prob_outer = .95
) +
xlab("s_log_gdp_per_cap")
pp_check(mod_1_lognormal, ndraws = 200)
pp_check(mod_1_lognormal, type = "stat_2d")
# Posterior Predictive Checks for second model
pp_check(mod_2_lognormal,
type = "intervals",
x = "s_log_mean_prev_obese",
prob_outer = .95
)
pp_check(mod_2_lognormal, ndraws = 200)
pp_check(mod_2_lognormal, type = "stat_2d")
# Check size of models
nrow(train_data_female_65_69)
[1] 364
nobs(mod_1_lognormal)
[1] 364
nobs(mod_2_lognormal)
[1] 364
# k-fold cross-validation
set.seed(42)
kfold_folds <- loo::kfold_split_random(
K = 10,
N = nrow(train_data_female_65_69)
)
set.seed(42)
kf_mod_1 <- kfold(mod_1_lognormal, chains = 1, folds = kfold_folds, save_fits = TRUE)
kf_mod_2 <- kfold(mod_2_lognormal, chains = 1, folds = kfold_folds, save_fits = TRUE)
data.frame(
model = c("Model 1", "Model 2"),
elpd_kfold = c(
kf_mod_1$estimates["elpd_kfold", "Estimate"],
kf_mod_2$estimates["elpd_kfold", "Estimate"]
),
se = c(
kf_mod_1$estimates["elpd_kfold", "SE"],
kf_mod_2$estimates["elpd_kfold", "SE"]
)
)
model elpd_kfold se
1 Model 1 1130.322 16.17500
2 Model 2 1124.900 15.43943
After using 10-fold cross-validation to compare the models, both were found to perform very well on predictive accuracy, with an expected log predictive density (ELPD) of 1130.32 and 1124.90, respectively. Additionally, model 1 had SE =16.17 and model 2 had SE=15.44, meaning that the difference in ELPD between the two models are not significantly different from each other, since their difference is well below each of their SEs. Therefore, both models are likely to generalize well on unseen data. However, as model 1 had Rhats of 1.00 and ESS > 1,000, this model seems more dependable than model 2, which had two Rhats of 1.01 and some ESS values < 1,000. Therefore, the estimates for model 1 are slightly more stable, and it’s parameters will be discussed in the following section.
All numerical features included in model 1 were scaled and centered. This model included two health-related features — log-transformed obesity prevalence, and log-transformed average calories of trans fats consumed per day — as well as year to capture the decreasing trend observed in proportion of CVD deaths from 2008 to 2019. This model also included log-transformed GDP per capita, to test whether the average economic conditions for our chosen demographic (women aged 65-69 years) would predict proportion of CVD deaths. Finally, to account for the observed variation in proportion of CVD deaths by country as well as by region, the model included the nested grouping factor of (1 | region/country). This allowed for partial pooling to explore potential differences between regions, as well as between countries after controlling for region.
*Note: Because the model fit a Beta regression with a logit link, the coefficients in the output are in log-odds. Therefore, estimated proportions will be provided for added clarity when appropriate.
The intercept estimate was -0.91, with a CI [-1.56, -0.23]. Considering the centered predictors, this estimate represents the baseline proportion of CVD deaths in our target demographic with the average values for log obesity prevalence, log trans fat calories per day, and log gdp per capita in the year 2013. In proportions, the estimate is 0.29.
The fixed effect estimate for the scaled year was -0.11, with a CI [-0.12, -0.09]. The confidence interval contains exclusively large negative log-odds values, meaning that we can be reasonably certain of a negative effect. Therefore, it is estimated that every unit change in scaled year corresponds to a 0.11 decrease in log-odds of proportion of CVD deaths, when other predictors are at their mean. For example, a unit increase in scaled year would lead to an estimated log-odds -1.02, or a proportion of 0.27.
The fixed effect estimate for the scaled log of trans fat calories per day was 0.01, with a CI [-0.00, 0.03]. The confidence interval includes 0, so we cannot be certain of a positive effect on log trans-fat consumption on proportion of CVD death rate after controlling for other predictors.
The fixed effect estimate for the scaled log of obesity prevalence was 0.15, with a CI [0.03, 0.25]. The confidence interval suggests that we can be reasonably certain of an estimated positive effect of log obesity prevalence on CVD death proportion when controlling for other predictors. Therefore, a unit increase in log of obesity prevalence from the scaled mean is estimated to have a log odds of -0.76 (or a proportion of 0.32).
The intercept estimate for the multilevel effect of region (with 7 levels) was 0.80, with a CI [0.41, 1.57]. The log-odds represented in the CI are positive and quite high, which suggests significant variation between regions. Therefore, the estimated intercepts per region differ from the overall intercept by an average of 0.80 log-odds. Given the main intercept of -0.91, the log-odds range of the differences would be (-1.71, -0.11). Translated to proportions of CVD deaths, this would be (0.15, 0.47).
The intercept estimate for the nested multilevel effect of country by region (with 38 levels) was 0.25, with a CI [0.20, 0.33]. This CI was less extreme than for region alone, but nonetheless suggests a strong variation between countries when controlling for region. The estimated intercepts per country, controlling for region, differ from the overall intercept by log-odds of 0.25. Considering the main intercept of -0.91, the log-odds range of the differences would be (-1.16, -0.66), or (0.24, 0.34) when translated to proportion of CVD deaths.
Finally, phi was estimated to be 1792.74 with a CI[1537.02, 2075.05]. This parameter corresponds to the degree to which the predictions are dispersed. As higher phi values indicate lower dispersion, the estimated phi value suggests very low dispersion, indicating that predictions are densely clustered around the mean.
# Get predicted and observed values
fitted_vals <- fitted(mod_1_lognormal, newdata = test_data_female_65_69)
pred_vals <- fitted_vals[, "Estimate"]
observed_vals <- test_data_female_65_69$proportion_of_cvd_deaths
# Calculate RMSE and MAE
rmse <- sqrt(mean((pred_vals - observed_vals)^2))
mae <- mean(abs(pred_vals - observed_vals))
cat("test RMSE:", round(rmse, 4))
test RMSE: 0.0116
cat("\n")
cat("test MAE:", round(mae, 4))
test MAE: 0.0082
# Plotting predicted and observed values
predictions_plot <- data.frame(
observed = observed_vals,
predicted = pred_vals
)
ggplot(predictions_plot, aes(x = observed, y = predicted)) +
geom_point(alpha = 0.8, color = "darkgreen") +
geom_abline(slope = 1, intercept = 0, color = "darkred", linetype = "dashed") +
labs(
title = "Proportion of CVD Deaths: Predicted vs. Observed Values",
x = "Observed", y = "Predicted"
) +
theme_bw()
Institute for Health Metrics and Evaluation. (n.d.). GBD Results Tool. Viz Hub. https://vizhub.healthdata.org/gbd-results/
Our World in Data. (n.d.). Share of adults who are smoking by level of prosperity [Data set]. Based on data from World Health Organization (2024); World Bank (2025); HYDE (2023); Gapminder – Population v7 (2022); UN World Population Prospects (2024); Gapminder – Systema Globalis (2022). https://ourworldindata.org/grapher/share-of-adults-who-are-smoking-by-level-of-prosperity
Our World in Data. (n.d.). GDP per capita (World Bank) [Data set]. Based on data compiled from multiple sources by the World Bank (2025), with minor processing by Our World in Data. https://ourworldindata.org/grapher/gdp-per-capita-worldbank